arXiv: 1504.08025v2 [cs.LG] 18Jun2016 


Note on Equivalence Between Recurrent Neural Network 
Time Series Models and Variational Bayesian Models 


Jascha Sohl-Dickstein 


JASCHA@STANFORD.EDU 


Diederik P. Kingma 


DPKINGMA® UVA.nl 


Abstract 

We observe that the standard log likelihood training objective for a Recurrent Neural Network (RNN) model of 
time series data is equivalent to a variational Bayesian training objective, given the proper choice of generative 
and inference models. This perspective may motivate extensions to both RNNs and variational Bayesian models. 
We propose one such extension, where multiple particles are used for the hidden state of an RNN, allowing a 
natural representation of uncertainty or multimodality. 


1. Recurren Neural Networks (RNNs) 

1.1. RNN definition 

A Recurrent Neural Network (RNN) [3] has a visible state x* at each time step, and a corresponding hidden state h‘. The 
dynamics can be described in terms of two distributions p (h*|h*“^, and p (x‘|h*). Typically the state of the hidden 

units h* is deterministic given x*“^ and the values at the previous timestep, such that h* = / (h*“^, x*“^). Taking 
slight liberties with notation, we indicate the distribution of the hidden units given their parents as: 

p (h‘|h‘-\ x‘-i) = <5 (h‘ - / (h‘-\ x‘-i)) , (1) 

p(H|X)=J(H-F(X)). (2) 

Where X = {x^ • ■ • x^} and H = {h^ ■ • • h^} are the full trajectories over visible and hidden units. The initial hidden 
vector is included as a model parameter. 


1.2. RNN training 


Training is typically performed by maximizing the log likelihood of this model, computed from a data distribution over 
trajectories q (X). Note that this empirical data distribution q (X) is most often simply a collection of datapoints, i.e. Dirac 
delta peaks. 
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2. Variational Bayesian Perspective 

Variational Bayesian methods where both the generative and inference models are trained against each other have recently 
proven very powerful for building probabilistic models of arbitrary distributions [8, 4, 2, 7, 6, 9, 1]. Here we show how an 
RNN can be interpreted using a variational Bayesian framework. 
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2.1. Inference model 


We take p (X, H) from Section 1.1 to be the ‘generative’ model, and introduce an ‘inference’ model q (H|X). We set the 
inference model to be identical to the corresponding conditional distribution in the generative model. 
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2.2. Log likelihood bound 

We now derive a variational bound K on the data log-likelihood L = Eq(x) [logp(X)] for these generative and inference 
distributions. 
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The bound in Equation 15 is identical to the RNN training objective in Equation 5. Therefore, for the choice of inference 
model in Equation 6, the variational Bayesian training objective is identical to the standard log likelihood training objective. 


2.3. Optimality of noise-free hidden dynamics 

Often, noise-free dynamics is optimal w.r.t. K. If the latent dynamics p(h‘x‘“^) is in the location-scale family with 
location /i(h*“^, x‘“^) and scale cr, we can parameterize the latent variables as h* = x*“^) -f cr ■ e*, where / is a 

deterministic function and e* ~ p(e) is independent zero-centered noise per timestep. Equation (14) can be written in this 
so-called non-centered form [5] as: 
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The inserted noise has the effect of removing information about previous states from the hidden state h‘, such that x* 
will be harder to predict. This contribution of the noise to h* can trivially be minimized by letting cr —0, i.e. by choosing 
p(H|X)=5(H-F(X))(eq.(15)). 


3. Discussion 

Erom one perspective it is a trivial observation that if q (H|X) = p (H|X), then the variational Bayesian objective becomes 
the true log likelihood objective. Erom another perspective, it is non-obvious and interesting that due to its causal structure. 
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a recurrent neural network can be viewed simultaneously as an inference and generative model, and that the inference 
model can be made trivially identical to the posterior of the generative model. 

Note that the equivalence q (H|X) = p (H|X) in Equation 8 relies on the true posterior distribution p (H|X) having the 
causal, factorial, structure p (H|X) = p (h‘|x*“^, This structure stems from the deterministic dynamics of the 

RNN, and would not hold in general if p (h‘|h*“^, were not a delta function. In this case the variational bound on 

the log likelihood in Equation 14 would continue to hold, but it would no longer be identical to the true log likelihood in 
Equation 4. 

This perspective on RNNs as consisting of matching inference and generative models may suggest natural extensions to 
the RNN framework, or novel model forms for variational Bayesian methods. As one example, it suggests the use of 
multiple inference particles in an RNN, which may allow more complex and multimodal distributions over visible units to 
be captured by simpler and lower dimensional hidden representations. 

3.1. Multiple particles 

RNNs are often called upon to represent a multimodal distribution over p (x*|x^ • • • x‘“^) (for instance, a distribution 
over words in the context of language models). Since the hidden state of an RNN is deterministic, it must capture this 
multimodal distribution using a single high dimensional vector h‘. 

In variational inference, a multimodal posterior can be approximated using multiple samples from the inference model. 
This raises the possibility of training an RNN with a multimodal distribution over hidden units. This has the potential to 
reduce the required complexity of the RNN. Rather than forcing a high dimensional unimodal distribution over hidden 
units to represent a multimodal distribution over visible units, instead multiple modes in a lower dimensional hidden 
representation can be made to correspond to the multiple modes over the visible units. 

Specifically, the hidden state can be extended to consist of L samples H = {Hi, H 2 , • • • , H^}. As shown in Appendix A 
these multiple samples can be averaged over in the variational Bayesian framework. This leads to the following modifica¬ 
tion of the training objective from Equation 15, 
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The multiple samples can be initialized with different (learned) initial vectors h{, allowing them to explore different modes 
despite being governed by the same deterministic dynamics. 







Appendix 
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A. Multiple particles in variational Bayesian models 

Here we modify the derivation in Section 2.2 to include multiple particles, leading to Equation 17. 
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where the steps between Equations 26 and 27 parallel those in Section 2.2. 
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